A script with all the R code in the chapter can be downloaded here. The Rmd for this chapter can be downloaded here.
The SNOTEL data are available here.
library(MARSS)
library(forecast)
Warning: package 'forecast' was built under R version 3.5.2
library(ggplot2)
library(ggmap)
Warning: package 'ggmap' was built under R version 3.5.2
library(broom)
The specific formulation of Equation @ref(eq:msscov-covars) creates restrictions on the assumptions regarding the covariate data. You have to assume that your covariate data has no error, which is probably not true. You cannot have missing values in your covariate data, again unlikely. You cannot combine instrument time series; for example, if you have two temperature recorders with different error rates and biases. Also, what if you have one noisy temperature sensor in the first part of your time series and then you switch to a much better sensor in the second half of your time series? All these problems require pre-analysis massaging of the covariate data, leaving out noisy and gappy covariate data, and making what can feel like arbitrary choices about which covariate time series to include.
To circumvent these potential problems and allow more flexibility in how we incorporate covariate data, one can instead treat the covariates as components of an auto-regressive process by including them in both the process and observation models. Beginning with the process equation, we can write \[\begin{equation} \begin{gathered} \begin{bmatrix}\xx^{(v)} \\ \xx^{(c)}\end{bmatrix}_t = \begin{bmatrix}\BB^{(v)} & \CC \\ 0 & \BB^{(c)}\end{bmatrix} \begin{bmatrix}\xx^{(v)} \\ \xx^{(c)}\end{bmatrix}_{t-1} + \begin{bmatrix}\uu^{(v)} \\ \uu^{(c)} \end{bmatrix} + \ww_t,\\ \ww_t \sim \MVN\begin{pmatrix}0,\begin{bmatrix}\QQ^{(v)} & 0 \\ 0 & \QQ^{(c)} \end{bmatrix} \end{pmatrix} \end{gathered} (\#eq:mssmiss-marsscovarx) \end{equation}\] The elements with superscript \({(v)}\) are for the \(k\) variate states and those with superscript \({(c)}\) are for the \(q\) covariate states. The dimension of \(\xx^{(c)}\) is \(q \times 1\) and \(q\) is not necessarily equal to \(p\), the number of covariate observation time series in your dataset. Imagine, for example, that you have two temperature sensors and you are combining these data. Then you have two covariate observation time series (\(p=2\)) but only one underlying covariate state time series (\(q=1\)). The matrix \(\CC\) is dimension \(k \times q\), and \(\BB^{(c)}\) and \(\QQ^{(c)}\) are dimension \(q \times q\). The dimension of \(\xx^{(v)}\) is \(k \times 1\), and \(\BB^{(v)}\) and \(\QQ^{(v)}\) are dimension \(k \times k\). The dimension of \(\xx\) is always denoted \(m\). If your process model includes only variates, then \(k=m\), but now your process model includes \(k\) variates and \(q\) covariate states so \(m=k+q\).
Next, we can write the observation equation in an analogous manner, such that \[\begin{equation} \begin{gathered} \begin{bmatrix} \yy^{(v)} \\ \yy^{(c)} \end{bmatrix}_t = \begin{bmatrix}\ZZ^{(v)} & \DD \\ 0 & \ZZ^{(c)} \end{bmatrix} \begin{bmatrix}\xx^{(v)} \\ \xx^{(c)} \end{bmatrix}_t + \begin{bmatrix} \aa^{(v)} \\ \aa^{(c)} \end{bmatrix} + \vv_t,\\ \vv_t \sim \MVN\begin{pmatrix}0,\begin{bmatrix}\RR^{(v)} & 0 \\ 0 & \RR^{(c)} \end{bmatrix} \end{pmatrix} \end{gathered} (\#eq:mssmiss-marsscovary) \end{equation}\] The dimension of \(\yy^{(c)}\) is \(p \times 1\), where \(p\) is the number of covariate observation time series in your dataset. The dimension of \(\yy^{(v)}\) is \(l \times 1\), where \(l\) is the number of variate observation time series in your dataset. The total dimension of \(\mathbf{y}\) is \(l+p\). The matrix \(\mathbf{D}\) is dimension \(l \times q\), \(\mathbf{Z}^{(c)}\) is dimension \(p \times q\), and \(\mathbf{R}^{(c)}\) are dimension \(p \times p\). The dimension of \(\mathbf{Z}^{(v)}\) is dimension \(l \times k\), and \(\mathbf{R}^{(v)}\) are dimension \(l \times l\).
The \(\mathbf{D}\) matrix would presumably have a number of all zero rows in it, as would the \(\mathbf{C}\) matrix. The covariates that affect the states would often be different than the covariates that affect the observations. For example, mean annual temperature might affect population growth rates for many species while having little or no affect on observability, and turbidity might strongly affect observability in many types of aquatic surveys but have little affect on population growth rate.
Our MARSS model with covariates now looks on the surface like a regular MARSS model: \[\begin{equation}
\begin{gathered}
\xx_t = \BB\xx_{t-1} + \uu + \ww_t, \text{ where } \ww_t \sim \MVN(0,\QQ) \\
\yy_t = \ZZ\xx_t + \aa + \vv_t, \text{ where } \vv_t \sim \MVN(0,\RR)
\end{gathered}
\end{equation}\] with the \(\xx_t\), \(\yy_t\) and parameter matrices redefined as in Equations and : \[\begin{equation}\label{eqn:marsscovarparams}
\begin{gathered}
\xx=\begin{bmatrix}\xx^{(v)}\\ \xx^{(c)}\end{bmatrix} \quad \BB=\begin{bmatrix}\BB^{(v)} & \CC \\ 0 & \BB^{(c)}\end{bmatrix} \quad \uu=\begin{bmatrix}\uu^{(v)}\\ \uu^{(c)}\end{bmatrix} \quad \QQ=\begin{bmatrix}\QQ^{(v)} & 0 \\ 0 & \QQ^{(c)}\end{bmatrix} \\
\yy=\begin{bmatrix}\yy^{(v)}\\ \yy^{(c)}\end{bmatrix} \quad \ZZ=\begin{bmatrix}\ZZ^{(v)} & \DD \\ 0 & \ZZ^{(c)}\end{bmatrix} \quad \aa=\begin{bmatrix}\aa^{(v)}\\ \aa^{(c)}\end{bmatrix} \quad \RR=\begin{bmatrix}\RR^{(v)} & 0 \\ 0 & \RR^{(c)}\end{bmatrix}
\end{gathered}
(\#eq:mssmiss-marss-covar)
\end{equation}\] Note \(\QQ\) and \(\RR\) are written as block diagonal matrices, but you could allow covariances if that made sense. \(\uu\) and \(\aa\) are column vectors here. We can fit the model (Equation @ref(eq:mssmiss-marss-covar)) as usual using the MARSS() function.
The log-likelihood that is returned by MARSS will include the log-likelihood of the covariates under the covariate state model. If you want only the the log-likelihood of the non-covariate data, you will need to subtract off the log-likelihood of the covariate model: \[\begin{equation} \begin{gathered} \xx^{(c)}_t = \BB^{(c)}\xx_{t-1}^{(c)} + \uu^{(c)} + \ww_t, \text{ where } \ww_t \sim \MVN(0,\QQ^{(c)}) \\ \yy^{(c)}_t = \ZZ^{(c)}\xx_t^{(c)} + \aa^{(c)} + \vv_t, \text{ where } \vv_t \sim \MVN(0,\RR^{(c)}) \end{gathered} (\#eq:mssmiss-covar-dummy) \end{equation}\] An easy way to get this log-likelihood for the covariate data only is use the augmented model (Equation @ref(eq:mssmiss-marsscovary) with terms defined as in Equation ) but pass in missing values for the non-covariate data. The following code shows how to do this.
y.aug = rbind(data, covariates)
fit.aug = MARSS(y.aug, model = model.aug)
fit.aug is the MLE object that can be passed to MARSSkf(). You need to make a version of this MLE object with the non-covariate data filled with NAs so that you can compute the log-likelihood without the covariates. This needs to be done in the marss element since that is what is used by MARSSkf(). Below is code to do this.
fit.cov = fit.aug
fit.cov$marss$data[1:dim(data)[1], ] = NA
extra.LL = MARSSkf(fit.cov)$logLik
Note that when you fit the augmented model, the estimates of \(\mathbf{C}\) and \(\mathbf{B}^{(c)}\) are affected by the non-covariate data since the model for both the non-covariate and covariate data are estimated simultaneously and are not independent (since the covariate states affect the non-covariates states). If you want the covariate model to be unaffected by the non-covariate data, you can fit the covariate model separately and use the estimates for \(\mathbf{B}^{(c)}\) and \(\mathbf{Q}^{(c)}\) as fixed values in your augmented model.
Let’s see an example using the Washington SNOTEL data. The data we will use is the snow water equivalent percent of normal. This represents the snow water equivalent compared to the average value for that site on the same day. We will look at a subset of sites in the Central Cascades in our snotel dataset (Figure @ref(fig:mssmiss-plotsnotel)).
load("snotel.RData")
y <- snotelmeta
# Just use a subset
y = y[which(y$Longitude < -121.4), ]
y = y[which(y$Longitude > -122.5), ]
y = y[which(y$Latitude < 47.5), ]
y = y[which(y$Latitude > 46.5), ]
(ref:snotelsites) Subset of SNOTEL sties used in this chapter.
(ref:snotelsites)
For the first analysis, we are just going to look at February Snow Water Equivalent (SWE). Our subset of stations is y$Station.Id. There are many missing years among some of our stations (Figure @ref(fig:mssmiss-plotsnotelts)).
(ref:snotelsites-plot) Snow water equivalent time series from each SNOTEL station.
swe.feb <- snotel
swe.feb <- swe.feb[swe.feb$Station.Id %in% y$Station.Id & swe.feb$Month ==
"Feb", ]
p <- ggplot(swe.feb, aes(x = Date, y = SWE)) + geom_line()
p + facet_wrap(~Station)
(ref:snotelsites-plot)
Imagine that for our study we need an estimate of SWE for all sites. We will use the information from the sites with full data to estimate the missing SWE for other sites. We will use a MARSS model to use all the available data.
\[\begin{equation} \begin{bmatrix} x_1 \\ x_2 \\ \dots \\ x_{15} \end{bmatrix}_t = \begin{bmatrix} b&0&\dots&0 \\ 0&b&\dots&0 \\ \dots&\dots&\dots&\dots \\ 0&0&\dots&b \end{bmatrix} \begin{bmatrix} x_1 \\ x_2 \\ \dots \\ x_{15} \end{bmatrix}_{t-1} + \begin{bmatrix} w_1 \\ w_2 \\ \dots \\ w_{15} \end{bmatrix}_{t} \\ \begin{bmatrix} y_1 \\ y_2 \\ \dots \\ y_{15} \end{bmatrix}_t = \begin{bmatrix} x_1 \\ x_2 \\ \dots \\ x_{15} \end{bmatrix}_t + \begin{bmatrix} a_1 \\ a_2 \\ \dots \\ a_{15} \end{bmatrix}_{t} + \begin{bmatrix} v_1 \\ v_2 \\ \dots \\ v_{15} \end{bmatrix}_t (\#eq:mssmiss-ar1) \end{equation}\]
We will use an unconstrained variance-covariance structure for \(\mathbf{w}\) and assume that \(\mathbf{v}\) is identical and independent and very low (SNOTEL instrument variability). The \(a_i\) determine the level of the \(x_i\).
We need our data to be in rows. We will use reshape2::acast().
dat.feb <- reshape2::acast(swe.feb, Station ~ Year, value.var = "SWE")
We set up the model for MARSS so that it is the same as @ref(eq:mssmiss-ar1). We will fix the measurement error to be small; we could use 0 but the fitting is more stable if we use a small variance instead. When estimating \(\mathbf{B}\), setting the initial value to be at \(t=1\) instead of \(t=0\) works better.
ns <- length(unique(swe.feb$Station))
B <- "diagonal and equal"
Q <- "unconstrained"
R <- diag(0.01, ns)
U <- "zero"
A <- "unequal"
x0 <- "unequal"
mod.list.ar1 = list(B = B, Q = Q, R = R, U = U, x0 = x0, A = A,
tinitx = 1)
Now we can fit a MARSS model and get estimates of the missing SWEs. Convergence is slow. We set \(\mathbf{a}\) equal to the mean of the time series to speed convergence.
library(MARSS)
m <- apply(dat.feb, 1, mean, na.rm = TRUE)
fit.ar1 <- MARSS(dat.feb, model = mod.list.ar1, control = list(maxit = 5000),
inits = list(A = matrix(m, ns, 1)))
The \(b\) estimate is ````.
Let’s plot the estimated SWEs for the missing years (Figure @ref(fig:mssmiss-snotelplotfits-ar1)). These estimates use all the information about the correlation with other sites and uses information about correlation with the prior and subsequent years. We will use the tidy() function from the broom package to get the estimated 95% confidence intervals for the estimated states. Notice that for some sites, CIs are low in early years as these sites are highly correlated with site for which there are data. In other sites, the uncertainty is high in early years because the sites with data in those years are not highly correlated.
(ref:mssmiss-snotelplotfits-ar1) Estimated SWEs from the expected value of the states \(\hat{x}\) conditioned on all the data. Note we could also use the expected value of the \(y\) conditioned on all the data.
fit <- fit.ar1
d <- augment(fit, interval = "confidence")
d$Year <- d$t + 1980
d$Station <- d$.rownames
p <- ggplot(data = d) + geom_line(aes(Year, .fitted)) + geom_ribbon(aes(x = Year,
ymin = .conf.low, ymax = .conf.up), linetype = 2, alpha = 0.5)
p <- p + geom_point(data = swe.feb, mapping = aes(x = Year, y = SWE))
p + facet_wrap(~Station) + xlab("") + ylab("SWE (demeaned)")
(ref:mssmiss-snotelplotfits-ar1)
If we were using these SWE as covariates in a site specific model, we could then use the estimates as our covariates, however this would not incorporate uncertainty. Alternatively we could use Equation @ref(eq:mssmiss-marsscovarx) and set the parameters for the covariate process to those estimated for our covariate-only model. This approach will incorporate the uncertainty in the SWE estimates in the early years for the sites with no data.
Note, we should do some cross-validation (fitting with data left out) to ensure that the estimated SWEs are well-matched to actual measurements. It would probably be best to do ‘leave-three’ out instead of ‘leave-one’ out since the estimates for time \(t\) uses information from \(t-1\) and \(t+1\) (if present).
The state residuals have a tendency for negative autocorrelation at lag-1 (Figure @ref(fig:mssmiss-stateresids-ar1)).
(ref:mssmiss-stateresids-ar1) State residuals for the AR(1) model. Many stations for autocorrelation at lag-1.
fit <- fit.ar1
par(mfrow = c(4, 4), mar = c(2, 2, 1, 1))
apply(residuals(fit)$state.residuals[, 1:30], 1, acf)
(ref:mssmiss-stateresids-ar1)
Another approach is to treat the February data as temporally uncorrelated. The two longest time series (Paradise and Olallie Meadows) show minimal autocorrelation so we might decide to just use the correlation across stations for our estimates. In this case, the state of the missing SWE values at time \(t\) is the expected value conditioned on all the stations with data at time \(t\) given the estimated variance-covariance matrix \(\mathbf{Q}\).
\[\begin{equation} \begin{bmatrix} x_1 \\ x_2 \\ \dots \\ x_{15} \end{bmatrix}_t = \begin{bmatrix} w_1 \\ w_2 \\ \dots \\ w_{15} \end{bmatrix}_{t}, \, \begin{bmatrix} w_1 \\ w_2 \\ \dots \\ w_{15} \end{bmatrix}_{t} \sim \begin{bmatrix} \sigma_1&\zeta_{1,2}&\dots&\zeta_{1,15} \\ \zeta_{2,1}&\sigma_2&\dots&\zeta_{2,15} \\ \dots&\dots&\dots&\dots \\ \zeta_{15,1}&\zeta_{15,2}&\dots&\sigma_{15} \end{bmatrix} \\ \begin{bmatrix} y_1 \\ y_2 \\ \dots \\ y_{15} \end{bmatrix}_t = \begin{bmatrix} x_1 \\ x_2 \\ \dots \\ x_{15} \end{bmatrix}_t + \begin{bmatrix} a_1 \\ a_2 \\ \dots \\ a_{15} \end{bmatrix}_{t} + \begin{bmatrix} v_1 \\ v_2 \\ \dots \\ v_{15} \end{bmatrix}_t (\#eq:mssmiss-tempind) \end{equation}\] Again \(\mathbf{a}\) is the mean level in the time series. Note that the expected value of \(\mathbf{x}\) is zero if there are no data, so \(E(\mathbf{x}_0)=0\).
ns <- length(unique(swe.feb$Station))
B <- "zero"
Q <- "unconstrained"
R <- diag(0.01, ns)
U <- "zero"
A <- "unequal"
x0 <- "zero"
mod.list.corr = list(B = B, Q = Q, R = R, U = U, x0 = x0, A = A,
tinitx = 0)
Now we can fit a MARSS model and get estimates of the missing SWEs. Convergence is slow. We set \(\mathbf{a}\) equal to the mean of the time series to speed convergence.
m <- apply(dat.feb, 1, mean, na.rm = TRUE)
fit.corr <- MARSS(dat.feb, model = mod.list.corr, control = list(maxit = 5000),
inits = list(A = matrix(m, ns, 1)))
The estimated SWEs for the missing years uses the information about the correlation with other sites only.
fit <- fit.corr
d <- broom::augment(fit, interval = "confidence")
d$Year <- d$t + 1980
d$Station <- d$.rownames
p <- ggplot(data = d) + geom_line(aes(Year, .fitted)) + geom_ribbon(aes(x = Year,
ymin = .conf.low, ymax = .conf.up), linetype = 2, alpha = 0.5)
p <- p + geom_point(data = swe.feb, mapping = aes(x = Year, y = SWE))
p + facet_wrap(~Station) + xlab("") + ylab("SWE (demeaned)")
The model residuals have a tendency for negative autocorrelation at lag-1.
fit <- fit.corr
par(mfrow = c(4, 4), mar = c(2, 2, 1, 1))
apply(residuals(fit)$model.residuals[, 1:30], 1, acf)
Another approach we might take is to model SWE using Dynamic Factor Analysis. Our model might take the following form with two factors, modeled as AR(1) processes. \(\mathbf{a}\) is the mean level of the time series.
\[ \begin{bmatrix} x_1 \\ x_2 \end{bmatrix}_t = \begin{bmatrix} b_1&0\\0&b_2 \end{bmatrix} \begin{bmatrix} x_1 \\ x_2 \end{bmatrix}_{t-1} + \begin{bmatrix} w_1 \\ w_2 \end{bmatrix}_{t} \\ \begin{bmatrix} y_1 \\ y_2 \\ \dots \\ y_{15} \end{bmatrix}_t = \begin{bmatrix} z_{1,1}&0\\z_{2,1}&z_{2,2}\\ \dots\\z_{3,1}&z_{3,2} \end{bmatrix}\begin{bmatrix} x_1 \\ x_2 \end{bmatrix}_t + \begin{bmatrix} a_1 \\ a_2 \\ \dots \\ a_{15} \end{bmatrix} + \begin{bmatrix} v_1 \\ v_2 \\ \dots \\ v_{15} \end{bmatrix}_t \]
The model is set up as follows:
ns <- dim(dat.feb)[1]
B <- matrix(list(0), 2, 2)
B[1, 1] <- "b1"
B[2, 2] <- "b2"
Q <- diag(1, 2)
R <- "diagonal and unequal"
U <- "zero"
x0 <- "zero"
Z <- matrix(list(0), ns, 2)
Z[1:(ns * 2)] <- c(paste0("z1", 1:ns), paste0("z2", 1:ns))
Z[1, 2] <- 0
A <- "unequal"
mod.list.dfa = list(B = B, Z = Z, Q = Q, R = R, U = U, A = A,
x0 = x0)
Now we can fit a MARSS model and get estimates of the missing SWEs. We pass in the initial value for \(\mathbf{a}\) as the mean level so it fits easier.
library(MARSS)
m <- apply(dat.feb, 1, mean, na.rm = TRUE)
fit.dfa <- MARSS(dat.feb, model = mod.list.dfa, control = list(maxit = 1000),
inits = list(A = matrix(m, ns, 1)))
The state residuals are uncorrelated.
fit <- fit.dfa
par(mfrow = c(1, 2), mar = c(2, 2, 1, 1))
apply(residuals(fit)$state.residuals[, 1:30, drop = FALSE], 1,
acf)
As are the model residuals:
par(mfrow = c(4, 4), mar = c(2, 2, 1, 1))
apply(residuals(fit)$model.residual, 1, function(x) {
acf(na.omit(x))
})
When we look at all months, we see that SWE is highly seasonal. Note October and November are missing for all years.
swe.yr <- snotel
swe.yr <- swe.yr[swe.yr$Station.Id %in% y$Station.Id, ]
swe.yr$Station <- droplevels(swe.yr$Station)
Set up the data matrix of monthly SNOTEL data:
dat.yr <- snotel
dat.yr <- dat.yr[dat.yr$Station.Id %in% y$Station.Id, ]
dat.yr$Station <- droplevels(dat.yr$Station)
dat.yr$Month <- factor(dat.yr$Month, level = month.abb)
dat.yr <- reshape2::acast(dat.yr, Station ~ Year + Month, value.var = "SWE")
We will model the seasonal differences using a periodic model. The covariates are
period <- 12
TT <- dim(dat.yr)[2]
cos.t <- cos(2 * pi * seq(TT)/period)
sin.t <- sin(2 * pi * seq(TT)/period)
c.seas <- rbind(cos.t, sin.t)
We will create a state for the seasonal cycle and each station will have a scaled effect of that seasonal cycle. The observations will have the seasonal effect plus a mean and residuals (observation - season - mean) will be allowed to correlate across stations.
ns <- dim(dat.yr)[1]
B <- "zero"
Q <- matrix(1)
R <- "unconstrained"
U <- "zero"
x0 <- "zero"
Z <- matrix(paste0("z", 1:ns), ns, 1)
A <- "unequal"
mod.list.dfa = list(B = B, Z = Z, Q = Q, R = R, U = U, A = A,
x0 = x0)
C <- matrix(c("c1", "c2"), 1, 2)
c <- c.seas
mod.list.seas <- list(B = B, U = U, Q = Q, A = A, R = R, Z = Z,
C = C, c = c, x0 = x0, tinitx = 0)
Now we can fit the model:
m <- apply(dat.yr, 1, mean, na.rm = TRUE)
fit.seas <- MARSS(dat.yr, model = mod.list.seas, control = list(maxit = 500),
inits = list(A = matrix(m, ns, 1)))
Figure () shows the seasonal estimate plus mean for each station. This is \(z_i x_i + a_i\).
The estimated SWE at each station is \(E(y_i|y_{1:T})\). This is the estimate conditioned on all the data and includes the seasonal component plus the information from the data from other stations. Because we estimated a \(\mathbf{R}\) matrix with covariance, stations with data at time \(t\) help inform the value of stations without data at time \(t\). Only years up to 1990 are shown, but the model is fit to all years.
# this is the estimate of y conditioned on all the data
dd <- MARSShatyt(fit.seas)$ytT
rownames(dd) <- rownames(dat.yr)
colnames(dd) <- colnames(dat.yr)
ddd <- reshape2::melt(dd)
ddd$Var2 <- factor(ddd$Var2, levels = paste0(rep(1981:2013, each = 12),
"_", month.abb))
colnames(ddd) <- c("Station", "Year_Month", "SWE")
ddd <- ddd[order(ddd$Station, ddd$Year_Month), ]
ddd$Date <- swe.yr$Date
ddd$Year <- swe.yr$Year
ddd <- subset(ddd, Year < 1990)
p <- ggplot(data = ddd) + geom_line(aes(x = Date, y = SWE))
p <- p + geom_point(data = subset(swe.yr, Year < 1990), mapping = aes(x = Date,
y = SWE))
p + facet_wrap(~Station) + xlab("") + ylab("SWE")
Warning: Removed 1250 rows containing missing values (geom_point).